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We consider the general problem of resource sharing in societal networks, consisting of interconnected com¬ 
munication, transportation, energy and other networks important to the functioning of society. Participants 
in such network need to take decisions daily, both on the quantity of resources to use as well as the periods 
of usage. With this in mind, we discuss the problem of incentivizing users to behave in such a way that 
society as a whole benefits. In order to perceive societal level impact, such incentives may take the form of 
rewarding users with lottery tickets based on good behavior, and periodically conducting a lottery to trans¬ 
late these tickets into real rewards. We will pose the user decision problem as a mean field game (MFG), 
and the incentives question as one of trying to select a good mean field equilibrium (MFE). In such a frame¬ 
work, each agent (a participant in the societal network) takes a decision based on an assumed distribution 
of actions of his/her competitors, and the incentives provided by the social planner. The system is said to be 
at MFE if the agent’s action is a sample drawn from the assumed distribution. We will show the existence 
of such an MFE under different settings, and also illustrate how to choose an attractive equilibrium using 
as an example demand-response in energy networks. 

Additional Key Words and Phrases: Mean field games, societal networks, lottery scheme, threshold policy, 
smart grid 


1. INTRODUCTION 

There has recently been much interest in understanding societal networks, consisting 
of interconnected communication, transportation, energy and other networks that are 
important to the functioning of human society. These systems usually have a shared 
resource component, and participants have to periodically take decisions on when and 
how much to utilize such resources. Research into these networks often takes the form 
of behavioral studies on decision making by the participants, and whether it is possible 
to provid e incentives to modi^ their behavior i n such a way that the society as a whole 
benefits | |Merugu et al. 200^|Prabhakar 201^ . 

Our candidate application in this paper is that of a Load Serving Entity (LSE) or 
a Load Aggregator (LA) {e.g., a utility company) trying to reduce its exposure to daily 
electricity market volatility by incentivizing demand response in a Smart Grid set¬ 
ting. The reason for our choice is the ready availability of data and reliable models 
for the cost and payoff structure that enables a realistic study. For instance, consider 
Figure which shows the (wholesale) price of electricity at different hours of day dur¬ 
ing the summer months in Texa s. The data was obtained from the Electric Reliability 
Council of Texas | ERGOT 20141, an organization that manages the deregulated whole¬ 
sale energy market in the state. The price shows considerable variation during the 
day, and peaks at about 5 PM, which is the time at which maximum demand occurs. A 
major source of this deman d in Texas is air con ditioning, which in each home is of the 
order of 30 kWh per day by [Pecan-Street 2014) . Incentivizing customers to move a few 
kWh of peak-time usage to the sides of the peak each day could lead to much reduced 
risks of peak price borne by the LSE. Such demand shaping could also have a positive 
effect on environmental impact of power plant emissions, since supplying peak load is 
associated with inefficient electricity generation. 
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Day-ahead Prices Distribution 



Fig. 1. Day-ahead electricity market prices in dollars per MWh on an hourly basis between 12 AM to 12 
PM, measured between June-August, 2013 in Austin, TX. Standard deviations above and below the mean 
are indicated separately. 

As an example, we take the baseline temperature setpoint as 22.5°C, and consider 
a customer that every day increases the setpoint by 1°C in 5 — 6 PM and decreases 
the setpoint by 0.5°C in the off-peak times. We will see later that even such a small 
change of the setpoint of the AC can yield substantial increases in the utility of the 
participating users, which in our case is a group of fifty homes. In fact, we will also 
see that with the same small changes the incentives to the users can be tuned to 
achieve different trade-offs, either just a utility increase for the users, just savings 
to the LSE (of the order of eighty dollars in our case) or any objective that chooses 
some appropriate mix of the two. This result is under the implicit assumption that the 
LSE in question is a price-taker so that changes in its demand profile are assumed 
not to perturb the prices. The shifting of daily energy usage could potentially cause a 
small increase in the mean and deviation of the internal home temperature, which is a 
discomfort cost borne by the customer. In our system (that we are currently developing, 
and whose ideal performance we wish to analyze in this paper), the LSE awards a 
number of “Energy Coupons” to the customer in proportion to his usage at the non¬ 
peak times, and these coupons are used as tickets for a lottery conducted by the LSE. 
A higher number of coupons would be obtained by choosing an option that potentially 
entails more discomfort, and would also imply a higher probability of winning at the 
lottery. Since the customers do not observe the impact of day-ahead prices on a day-to- 
day basis nor do they see the aggregate demand at the LSE, the lottery scheme serves 
as a mechanism to transfer some of this information over to the customers by coupling 
them. We will explicitly demonstrate the advantage of this coupling over an individual 
incentive scheme (a fixed reward for peak time reductions) that serves as a benchmark 
for the comparison. 

In our analytical model, each agent has a set of actions that it can take in each play 
of a repeated game, with each action having a corresponding cost. Higher cost actions 
yield a higher number of coupons. At the end of each play, the agents participate in 
a lottery in which they are randomly permuted into groups, and one or more prizes 
are given in each group. The state of each agent is measured using his surplus, which 
captures the history of plays experienced by the agent, and is a proxy to capture his 
interest in participating in the incentive system. Each win at the lottery increases 
the surplus, and each loss decreases it. Furthermore, we assume that the agent has a 
prospect incremental utility function that is increasing and concave for positive sur¬ 
plus and convex for negative surplus. This prospect theory model captures the decision 
making under risk and uncertainty for agents. Any agent could depart from the sys- 
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tem with a fixed probability, and a departing agent is replaced by a new entrant with 
a randomly drawn surplus. The main question we answer in this paper then is how 
would agents decide on what action to take at each play? Having answered this we 
also comment on impact of this on the sum total value of the agents, the return to 
the system and the trade-off between these two quantities provided by our proposed 
scheme. 

1.1. Prospect Theory 

Most previous studies account for uncertainty in agent payoffs by means of expected 
utility theory (EUT). Under EUT, the objective of the decision maker is to maximize the 
probabilistically weighted average utilities under different outcomes. However, EUT 
does not incorporate observed behavior of human agents, who can take decisions de¬ 
viating from this conventional norm. For example, empirical studies have shown that 
agents ascribe higher weights to rare, positive events (such as winning at a lottery) 

jKahneman and Tversky 1979^ _ 

Prospect theory (PT) | Kahneman and Tversky 1979[jTversky and Kahneman 1992 


Tversky and Kahneman 1981f|Kahneman and Tversky 1984[ is perhaps the most 
well -known alternative theory to E UT. It was originally developed for binary lotter¬ 
ies [ Kahneman and Tversky 19791 and later refined to deal wit h issues related to 
multiple outcomes and valuations [ Tversky and Kahneman 1992) . This Nobel-prize- 
in-economics-winning theory has been observed to provide a more accurate descrip¬ 
tion of decision making under risk and uncertainty than EUT. There are three key 
characteristics of PT. First, the value function is concave for gains, convex for losses, 
and steeper for losses than for gains. This feature is due to the observation that most 
decision makers prefer avoiding losses to achieving gains. Thus, the value function is 
usually S-shaped. Second, a nonlinear transformation of the probability scale is in ef¬ 
fect, i.e., decision makers will overweight low probability events and underweight high 
probability events. The weighting function usually has an inverted S-shape, i.e., it is 
steepest near endpoints and shallower in the middle of the range, which captures the 
behaviors related to risk seeking and risk aversion. Finally the third, the framing ef¬ 
fect is accounted for, i.e., the decision maker takes into account the relative gains or 
losses with respect to a reference point rather than the final asset position. As PT fits 
better in reality than EUT based on m any empirical studies, it has been widely used in 
many contexts such as soc ial sciences ||Gao et al. 2010 Harrison and Rutstrdm 2009|, 
communication n etworks | Li and Mandayam 201^ Clark et al. 2002[ Yu et al. 2014| 
and smart grids [ Wang et al. 2014f|Xiao et al. 2015| . Since we study equilibria that 
arise through agents^ repeated play in lotteries, we use prospect theory as opposed to 
expected utility theory to account for agent-perceived value while taking decisions. 

1.2. Mean Field Games 

The problem described is an example of a dynamic Bayesian game with incomplete 
information, wherein each player has to estimate the actions of all his potential op¬ 
ponents in the current lottery (and in the future) without knowing their surpluses, 
play a best response, and update his beliefs about their states of surplus based on 
the outcome of the lottery. However, since the set of agents is large and, from the per¬ 
spective of each agent, each lottery is conducted with a randomly drawn finite set of 
opponents, an accurate approximation for any agent is to assume that the states of 
his opponents (and hence acti ons) are independent of each other. This is the setting 
of a Mean Field Game (MEG) [Lasry and Lions 2007 jJovanovic and Rosenthal 198^ 
Huang et al. 2006] , which we will use as a framework to study equilibria in societal 
networks. Here, the system is viewed from the perspective of a single agent, who as¬ 
sumes that each opponent’s action would be drawn independently from an assumed 
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distribution, and plays a best response action. We say that the system is at a Mean 
Field Equilibrium (MFE) if this best response action turns out to be a sample drawn 
from the assumed distribution. We will use such an MFG model to model dynamics in 
societal networks. 

In our analysis, we can prove that regardless of the exact form of the utility func¬ 
tion, an MFE exists in our system. In our numerical study, we use the prospect utility 
function for study, we can observe further properties of the value function based on 
this special utility function. 


1.3. Demand Response in Deregulated Markets 

Demand Response is the term used to refer to the idea of customers being incentivized 
in some manner to change their normal electricity usage patterns in re sponse to peaks 
in the wholesale price of electric power l Albadi and El-Saadany 2008) . Many methods 
of achieving demand response exist, including an extreme one of turning off power for 
short intervals to customers a few times a year if the price is very high. In any method 
of achieving demand response, customers expect a subsidy in return, often in terms of 
a reduced electricity bill. 

The idea is particularly relevant in deregulated electricity markets that exist in 
several US states, such as in Texas, wherein the firm that serves customer demand 
might have no infrastructure of its own, and merely buys on the wholesale market and 
sells to the home consumer. Customers have a choice between many different LSEs 
that they can obtain service from. For instance, many urban neighborhoods in Texas 
are served by 5 — 10 LS Es and customers can p eriodically choose to sign contracts of 1 
to 36 months with them [PowerToChoose 2016V 


1.4. Main Results 

Our objective in this paper is to design a system that would incentivize the conver¬ 
gence of user action profiles to one that would result in large savings to the LSE. Our 
contributions in this paper towards such an objective are as follows: 

(1) We propose a mean field model to capture the dynamics in societal networks. 
Our model is well suited to large scale systems in which any given subset of agents 
interact only rarely. This kind of system satisfies a chaos hypothesis that enables us 
to use the mean field approximation to accurately model agent interactions. The state 
of the mean field agent is his surplus, which forms a Markov process that increases 
by winning and decreases by losing at the lottery. Our mean field model of societal 
networks is quite general, and can be applied to different incentive schemes that 
are currently being proposed in the field of public transportation and communication 
network usage. 

(2) We develop a characterization of a lottery in which multiple rewards can be 
distributed, but with each participant getting at most one by withdrawing the winner 
in each round. Each lottery is played amongst a cluster of M agents drawn from a 
random permutation of the set of all agents. While the exact form of the lottery is not 
critical to our results, we present it for completeness. 

(3) We characterize the best response policy of the mean field agent, using a dynamic 
programming formulation. We find that under our assumptions the value function is 
continuous in the action distribution. Further, we show using this result that given 
our ordering in which higher cost actions result in a higher probability of winning 
the lottery (due to more coupons being given), the choice of one action versus another 
depends on thresholds in the surplus, i.e., we obtain a threshold policy for the action 
choices. 

(4) The probability of winning the lottery defines the transition kernel (along with the 
regeneration distribution) of the Markov process of the surplus, and hence maps an 
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assumed distribution across competitors states to a resultant stationary distribution. 
We show the existence of a fixed point of this kernel, which is the MFE, by using 
Kakutani’s fixed point theorem. Our proof of the existence of MFE does not depend 
on the shape of the utility function, which is quite general. Since we have a discrete 
action and state space, showing a fixed point in the space of stationary distributions 
is quite intricate. 

(5) We conduct a simulation analysis of our proposed scheme using an accurate model 
of the daily usage of electricity in each hour, using available measurements over 
several months in Texas. We also use the data on wholesale electricity prices during 
the interval to calculate what times of day would yield the best returns to rewards. 
We show that if customers are willing to change the setpoints of AC as small as 1°C 
each day then each week the LSE gains a benefit of the order of a $80 over a cluster of 
50 homes, with about a $20 weekly reward. 

(6) We conduct comparative studies between a benchmark scheme that returns a fixed 
reward per action to a customer versus the lottery scheme, and show that the lottery 
scheme can outperform the fixed reward scheme by about 100% in terms of total value 
to the users, and about 20% in terms of profit to the LSE. Furthermore, we also explore 
the relation between LSE profit and user value for both schemes, and show that as 
one changes the reward values, the lottery scheme bounds the achievable region of the 
benchmark scheme in a Pareto-sense: it is better able to attain a desired combination 
of user value and LSE profit, and includes many combinations that the individual 
incentive scheme cannot achieve. 


A 2-page conference abstract that includes a high-level over view of our results de¬ 
veloped herein was presented to practitioners in [Li et al. 20151. 


1.5. Related Work 

In terms of the mean field game, our framew ork i s based on work such as Pyer et ^ 


2011[ [Manjrekar et al. 2014[ Li et al. 2016a) . In [Iyer et al. 2011) the setting is that 
of advertisers bidding for spots on a webpage, and the focus is on learnin g the value 


of winning (making a sale though the advertisement) as time proceeds. In | Manjrekiu] 
et al. 20141, apps on smart phones bid for service from a cellular base station, and 
the goal is to ensure that the service regime that results has low per-packe t delays. 


In both works, the existence of an MFE with desired properties is proved. In | Li et al. 
2016a), the objective is to incentivize truthful revelation of state that would allow for 


optimal resource allocation in a device-to-device wireless network. The state space is 
discrete, and the focus is on exploration of truthful dynamic mechanisms. Unlike the 
previous work, all of which focussed on pure strategy equilibria, our current work has 
a more complex state space and pure strategy equilibria may not exist due to the non¬ 
uniqueness of best responses. Hence, we seek a mixed strategy equilibrium, which 
necessitates a different proof technique. 

Nudge systems are typically designed and used to encourage socially beneficial be¬ 
haviors and individually beneficial behaviors. For instance, lottery schemes are widely 
used in practice to inc entivize go od behavior, e.g. , to comba t (sales) tax evasion i n 
Brazil ( [Naritomi 2013) ), Portugal ( |Poco et al. 2015|), Taiwan (||Chen and Wang 2010)), 
and for Internet congestion management (||Loiseau et al. 2014)). Similarly, fMerugu 


et al. 200^ Prabhakar 20131 provide experimental results on designing lottery based 


“nudge engines” to provide incentives to participants to modify their behaviors in the 


context of evenly distr ibuting load on public transportation. In another scheme, |AII- 
cott and Kessler 2015) study the impact of nudging on social welfare by sending one- 


year home energy reports to participants and using multiple price lists to determine 
participants’ willingness to stay in the system for the next year. Our system is a form 
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of nudge engine, but our focus is on analytical characterization of system behavior and 
attained equilibria with large number of customers with repeated decision-making. We 
aim to design incentive schemes to modify customer behavior such that the system as 
the whole benefits from the attained equilibrium. 

Our idea of offerin g coupons for elect ricity usage at certain times of day is based 
on one presented in I Zhong et al. 201^ , which suggests offering such incentives to 
coincide with the predicted realtime p rice peaks. An experimental trial based on a 
similar idea is described in | |Bitar 2015| , in which the focus is on designing algorithms 
to coordinate d emand flexibility to enable the full utilization of variable renewable 
generation. In |Hao and Xie 2014) , this kind of system is modeled as a Stackelberg 
game with two stages: setting the coupon values followed by consumer choice. T he de¬ 
cision making model in all the above research is myopic. [Schwartz et al. 2012| study 
demand-response as trading off the cost of an action (such as modifying energy usage) 
against the probability of winning at a lottery in terms of a mean field game. However, 
the game is played in a single step according to their model, and there is no evolution 
of state or dynamics based on repeated play. Further, their conception of the mean 
field equilibrium is that the mean value of the action distribution (not the distribution 
itself) is invariant. Unlike these models, we are interested in characterizing repeated 
consumer choice with state evolution when the number of customers is large, and iden¬ 
tifying the action distribution and benefits (if any) of the resulting equilibrium. 

A rich literature studies lottery schemes, and here we can only hope to cover a frac¬ 
tion of them that we see most relevant. In this paper, we model lotteries as choosing 
a random permutation of the M agents participating in it, and picking the first K 
of them as winners, with the distribution on the symmetric group of permutations of 
{1, • • • , M} being a function of the coupons assigned to the different actions. Assuming 
that different actions yield different numbers of coupons, we will choose the distribu¬ 
tion such that more coupons results in a higher probability of winning. There are var¬ 


ious probabilistic models on permutations in the ranking literature IQin et al. 2010 


Lozano and Irurozki 2012), Here we use the popular Plackett-Luce model | Hunter 


2004) to implement our lotteries. While the Plackett-Luce model is used for c oncrete¬ 
ness, other probabi listic models on permutations such as the Thurstone model [Lozano 


[and Irurozki 2012 1 can also be used with the number of coupons as parameters of the 
distribution as long as more coupons results in a higher probability of winning. 


1.6. Organization 


The paper is organized as follows. In Section ^ we introduce our mean field model. In 
Section]^ we develop a characterization of a lottery in which multiple rewards can be 
distributed, but with each participant getting at most one by withdrawing the winner 
in each round. We discuss the basic property of the optimal value function in Section]^ 
The existence of MFE is considered in Section We characterize the best response 
policy of the mean field agent, using a dynamic programming formulation in Section 
[m We then conduct simulation-based numerical studies in Section on utilizing our 
mamework in the context of demand response in electricity markets. We conclude in 
Section]^ To ease exposition of our results, all proofs are relegated to the Appendix. 


2. MEAN FIELD MODEL 

We consider a general model of a societal network in which the number of agents is 
large. Agents have a discrete set of actions available to them, and must take one of 
these actions at each discrete time instant. The actions result in the agents receiv¬ 
ing coupons, with higher cost actions resulting in more coupons. The agents are then 
randomly permuted into clusters of size M, and a nudge is provided via a lottery that 
is held using the coupons to win real rewards. Thus, agents must take their actions 
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under some belief about the likely actions, and hence the likely coupons held by their 
competitors in the lottery. 

Figurej^illustrates the mean field approximation of our model, which is an accurate 


representation of the Bayesian syst em when the number of agents is large | Graham 


arid Meleard 19^|lyer et al. 20lH. The diagram is drawn from the perspective of a 


single agent (wl.o.g, let this be agent 1), who assumes that the actions played by each 
of his opponents would be drawn independently of each other from the probability 
mass function p. In this section, we will introduce the notation, costs and payoffs of 
the agent, and provide a brief description of the policy space and equilibrium. 


Assumed 

Action 

Distribution 


Utility 

function 



4 ' 


Probability Surplus Regeneration 
of winning Distribution Distribution 


Fig. 2. Mean Field Game. 

Time: Time is discrete and indexed by fc e {0,1, • • • }. 

Agents: As discussed above, the total number of agents is infinite, and in the MFG, we 
consider a generic agent 1 who in each lottery will be paired with M — 1 other agents 
drawn randomly from the infinite population. 

Actions: We suppose that each agent has the same action space denoted as A = 
{ 1 , 2 , •• • , |A|}. Hence, the action that this agent takes at time k is a[k] s A. Under 
the mean field assumption, the actions of the other agents would be drawn indepen¬ 
dently from the p.m.f p = [ 6 i, 62 , • • • j where ba is the probability mass associated 
with action a. We call p as the assumed action distribution. 

Costs: Each action a € A taken at time k has a corresponding cost 9a- This cost is fixed 
and represents the discomfort suffered by the agent in having to take that action. 
Coupons: When agent takes an action a, it is awarded some fixed number of coupons 
Va for playing that action. These coupons are then used by the agents as lottery tickets. 
Lottery: We suppose that there are only K rewards for agents in one cluster, where K 
is a fixed number less than M. The probability of winning is based on the number of 
coupons that each agent possesses. We model each lottery as choosing a permutation 
of the M agents participating in it, and picking the first K of them as winners. 

States: The agent keeps track of his history of wins and losses in the lotteries by 
means of his net surplus at time fc, denoted as x[k]. The value of surplus is the state of 
the agent, and is updated in a Markovian fashion as follows: 


^ x[k]-\- w, if agent 1 wins the lottery, 
\ a;[fc] — Z, if agent 1 loses the lottery. 


( 1 ) 


where w and I is the impact of winning or losing on surplus. Effectively, the assumption 
is that the agent expects to win at least an amount I at each lottery. Not receiving this 
amount would decrease his surplus. Similarly, if the prize money at the lottery is 10 +1, 
the increase in surplus due to winning is w. Surplus values are discrete, and the set of 
possible values is given by a countable X that ranges from (— 00 , + 00 ). 

Value function for prospect: The impact of surplus on the agent’s happiness is mod¬ 
eled by an S-shaped incremental utility function u{x[k]), which is monotone increasing. 
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concave for a positive surplus and convex for a negative surplus. Moreover, the i mpact 
of loss is usually larger th an that of gain of the same absolute value. Following |Tver- 
sky and Kahneman 1992), we use the value function for prospect 


i{x) = 


M+(a;) = cc'’', 


X > 0, 


' (x) =, X < 0 , 


( 2 ) 


where </? > 1 is the loss penalty parameter and 0 < 7 < 1 is the risk aversion parameter. 
A larger ip means that the operator is more loss averse, while a smaller 7 {i.e., the 
value function is more concave) indicates that the operator is more risk seeking. From 
past empirical studies |Tversky and Kahneman 1992 1| Kahneman and Tversky 1984J, 
realistic values are tp = 2.25 and 7 = 0.88. 

Weighting function for prospect: In earlier studies I Prelec 1998| , it has been ob¬ 
served that people tend to subjectively weight uncertain outcomes in real-life decision 
making. In the proposed game, this weighting factors capture the agent’s subjective 
evaluation on the mixed strategy of its opponents. Thus, under PT, instead of ob¬ 
jectively observing the probability of winning the lottery pp a, each user perceives a 
weighted version of it, (j){Pp,a)- Here, 4>{-) is a nonlinear transformation that maps the 
objective probability to a subjective one, which is monotonic increasing in probability. 
It has been shown in many PT studies that, people usually overw eight low pro bability 
outcomes and underweight high probability outcomes. Following [Prelec 1998| , we use 
the weighting function 


(^(p) = exp(—(—Inp)^), for 0 < p < 1 , 


(3) 


where ^ s ( 0 , 1 ] is the objective weight that characterizes the distortion between sub¬ 
jective and objective probability. Note that under the extreme case of ^ = 1, (|^ reduces 
to the conventional EUT probability, i.e., (j){p) = p. 

Regeneration: We assume that an agent may choose to quit the system at any time. 
This event occurs with probability 1 — /3, where /3 S (0,1). When this happens, a new 
agent takes the place of the old one, and his state is drawn from a probability mass 
function vf. 

Best Response Policy: The agent must choose an action at each time, including 
staying with the status-quo/baseline as an action too. The green/light tiles in Fig¬ 
ure 1^ relate to the problem of the agent determining his best response policy. The 
agent assumes that the actions taken by each of his M — 1 opponents are drawn 
independently from probability mass function p. Given this assumption, the state of 
his surplus is x and current utility is u(x), the agent must calculate the probabil¬ 
ity of winning at the lottery Pp,a(a;), if he were to take action a(x) € A, incurring a 
cost da(x) and gaining r^^x) coupons. Since the agent must take this decision repeat¬ 
edly, he must solve a dynamic program to determine his optimal policy. There could 
be many best response actions, and we assume that the agent chooses a randomized 
policy cr(x) = [(Ti(x), 0 - 2 ( 2 ^), •• • ,aa{x),--- ,cr|_ 4 |(x)], in which tTa(a;) specifies the proba¬ 
bility of playing action a when the agent’s surplus is x; in other words, we enlarge the 
space to include mixed strategies as pure strategy equilibria may not exist. The action 
taken by the agent is a random variable A ~ cr(x). The details of the lottery and how 
to calculate the probability of success are given in Section The properties of the best 
response policy are described in detail in Section]^ 

Stationary Surplus Distribution: The assumed action distribution p, and the best- 
response randomized policy a{x) yield the state transition kernel of the Markov chain 
corresponding to the surplus, via the probability of winning the lottery pp ^(x). This is 
illustrated by means of the blue/dark tiles in Figure The transition kernel also is 
influenced by the regeneration distribution The stationary distribution of surplus 
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associated with the transition kernel is denoted as Cp- This stationary distribution of 
the single mean field agent is equivalent to the one-step empirical state distribution of 
infinite agents who all take a (mixed-strategy) action, cr(a;) when state is x, assuming 
that the actions of their competitors would be drawn from p. 

Mean Field Equilibrium: The triple of an assumed action distribution p, randomized 
policy cr and stationary surplus distribution ( gets mapped into a triple of action dis¬ 
tribution p, best-response randomized policy a and a stationary surplus distribution 
C via the operations described above. A fixed point of the resulting map is called an 
MFE. For a formal definition and details of the proof of existence see Section 


3. LOTTERY SCHEME 

We first construct the lottery scheme that will be used in our mean field game. We 
permute all the agents into clusters, such that there are exactly M agents in each 
cluster, and conduct a lottery in each such cluster. Suppose there are K rewards for 
all agents in one cluster, where A is a fixed number less than M. When an agent 
takes an action, he/she will receive the credit (number of coupons) associated with that 
action. Then the probability of winning is based on the number of coupons that each 
agent possesses. We will model the lotteries as choosing a permutation of the M agents 
participating in it, and picking the first K of them as winners. Then different lottery 
schemes can be interpreted as choosing different distributions on the symmetric group 
of permutations on M. In particular, we will use ideas from the Plackett-Luce model 
to implement our lotteries. 

Without loss of generality, we assume that the actions are ordered in decreasing 
order of the costs so that 0i > • • • > 6 *^ 4 . In order to incentivize agents to take the more 
costly actions we will insist that the vector of coupons obtained for each action is also 
in decreasing order of the index, i.e., ri > • • • > r^. 

The specific lottery procedure we consider is the following: for every agent m that 
takes action a[m\ and receives coupons > 0, we choose an exponential random 
variable with mean l/r£j[m] and then pick the first K agents in increasing order of the 
realizations of the exponentials. Note the abuse of notation only in this section to use 
a[m] to refer to the action of agent m. Since we consider only one lottery, we do not 
consider time k. Let the agent m = 1,..., M receive ra[m\ number of coupons. The set 
of winners is a permutation over the agent indices, and we denote such a permutation 
by cr = [cti, 0 - 2 , • • • ctm]. We then have the probability of the permutation a given by 


M-l 

P(CT|r„[i],...,ra[M]) = n -• W) 

Essentially, after each agent is chosen as a winner, he is removed and the next lottery 
is conducted just as before but with fewer agents. 

We now analyze the probability of winning in our lottery. For analysis under the 
mean field assumption, it suffices to consider agent 1 with the coupons it gets by taking 
action a being denoted as Let At := { 2 ,..., M}, which is the set of opponents of 
agent 1. For these agents, suppose there are agents that choose action n, where 
J2neA Vn = M — 1. We denote the vector of these actions hy v = {vi,... ,va)- 
The conditional probability of agent 1 failing to obtain a reward is given by 


••• H 






•a[l] 


E 


m^Mi 


a[m ]) 


where L refers to the fact that agent 1 “loses,” Mi = M, and for Z > 2 we have Mi = 
Mi-i\ {ki-i}. Essentially, the above looks at the lottery process round by round, and is 
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a summation of the probabilities of all permutations in which agent 1 does not appear 
in the first spot in any round. 

The above expression considerably simplifies if the summations are instead taken 
over the actions ki that the lottery winner m at round Z S {1,..., K} can take. Note that 
we assume that we can distinguish the actions based on the number of coupons given 
out. If this were not true, then we could further simplify the expression by summing 
over the coupon space. Given a coupon/action profile v, let J{v) denote the actions 
that have non-zero entries. Additionally, by u — 1^ for k e J{v) denote the resulting 
coupon profile obtained by removing one entry at location k, and by the sum of all 
the coupons in profile v, i.e., fkVk- Then 

S ■■■ E ■ ■ ®) 

SlGy(iJi) Hk^J^vK) lh=lVa[l] + 1 

where = v, for I = 2 ,..., A, u* = v’'~^ — 1 ^, and is the number of entries at 
location k for coupon profile vK Note that pf" - is a decreasing function of ra[i\ for every 
V. Therefore, agent 1 comparing two actions i and j that have > rij will find 
pf €f(*) < PivU) fo’" sll V. Also by taking the limit of ra[i\ going to 0, having an action 
with 0 coupons results in a loss probability of 1 for every v. 

To determine the probability of winning in the lottery we need to account for the fact 
that the actions of the opponents are drawn from the distribution p (under the mean 
field assumption). Hence, the probability of obtaining the coupon profile (equivalently 
action profile) of the opponents v = {vi,.. . ,va) is given by the multinomial formula, 
i.e., 


Pp(u) 




n 


i&A ' 


( 6 ) 


Using ([^ and ([^, we obtain the winning probability for the mean field agent 1 when 
taking action a as 

Pp,a = 1 - PhS^pi^)- ( 7 ) 

v:\J{v)\=M-l 

By lower bounding each term in the conditional probability of not obtaining a reward 
we get Pp^a < 1 — =• Pw £ (0) 1)- If we ran the lottery without removing 

the winners (and any of their coupons), we obtain a lower bound on the probability of 
winning that has a simpler expression. Using this simpler expression we can obtain 
the lower bound pp,i rA+{M-i)rx Pw ^ fl^^f both bounds are 

independent of p. If we allow an action that yields 0 coupons, then the above bounds 
become trivial with = 1 and = 0. 

An important feature of our lottery scheme is that the probability of winning in¬ 
creases with the number of coupons given out. For simplicity we assumed a fixed re¬ 
ward for any win. However, we can extend the lotteries to ones where different rewards 
are given out at different stages, and also where the rewards are dependent on the 
number of coupons of the winner. For the latter, we will insist on the rewards being an 
increasing function of the number of coupons of the winner. Finally, we can also extend 
to scenarios where we choose the number of stages K in an (exogenous) random fash¬ 
ion in — 1}. Since the analysis carries through unchanged except with more 

onerous notation, we only discuss the simplest setting. 
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4. OPTIMAL VALUE FUNCTION 

As discussed in Section]^ the mean field agent must determine the optimal action to 
take, given his surplus x and the assumed action distribution p. We follow the usual 
quasi-linear combination of prospect function and cost consistent with Von Neumann- 
Morgenstern utility functions, and under which the impact of winning or losing in the 
lottery is on the surplus of the agent (and not simply a one-step myopic value change). 
Thus, the dynamic program that the agent in prospect theory needs to solve is 

Vp{x) = max {u{x) - 9a(x) + l5[(l>{pp^a{x))V[x + w) + (I){1 - Pp^a{x))V{x - 0]}- (8) 

a{x)^A ^ ' 

Note that Pp,a{x) is a result of a lottery that we described in detail in Sectionand 
(/)(•) is the weighting function, which overweights small probabilities (win the Imtery) 
and underweights moderate an high probabilities (lose the lottery). Here we use the 
weighting function defined in ([^. 

First, we need to define a set of functions as 

$ = i / : X -)► M : sup 

L a;GX 



where f2(a;) = max{|u(a;)|, 1}. Note that $ is a Banach space with H—norm. 


ll/lln = sup 

a:GX 


f{x) 

Q,{x) 


< oo. 


Also define the Bellman operator Tp as 

Tpf{x) = max {u{x) - 6a(x) + /3[<^(Pp,a(a;))/(a; + w) + 0(1 - Pp^a{x))f{x - ()]}, (9) 

a{x)GA 

where / G $. 

We now show that the optimal value function Vp{x) exists and it is continuous in p. 

Lemma 4.1. 1) There exists a unique f* G 4>, such that Tpf*{x) = f*{x) for every 
a: G X, and given a; G X, for every / G 4), lae have Tff{x) —)■ f*{x), as n ^ oo. 

2) The fixed point f* of operator Tp is the unique solution of Equation i.e. f* = Vf. 

Lemma 4.2. The value function Vpf ) is Lipschitz continuous in p. 


4.1. Stationary distributions 

For a generic agent, w.l.o.g., say agent 1, we consider the state process {a:i[fc]}^o- 
It’s a Markov chain with countable state-space X, and it has an invariant transition 
kernel given by a combination of the randomized policy a{x) at each surplus x for any 
a{x) G A, and the lottery scheme from Section 1^ By following this Markov policy, we 
get a process {lF[fc]}^Q that takes values in ^in,lose} with probability Pp,aix) for 
the win, drawn conditionally independent of the past (given a;i[fc]). Then the transition 
kernel conditioned on W[k] is given by 

P(a;i[fc] G B\xi[k-l] = a;, lF[fc]) =+ (1 - 0)«'(S), (10) 

where H c X and is the probability measure of the regeneration process for surplus. 
The unconditioned transition kernel is then 

P(a;i[fc] G B\xi[k- 1] = x) =j3 ^ aa{x)pp^a{x)lx+w^B 

a{x)£A 

+ X! Cra(x)pp,a(x))lx-ZGB + (1 - 

a{x)£A 


(11) 
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Lemma 4.3. The Markov chain where the action policy is determined by a(x) based 
on the states of the users and the transition probabilities in iM' is positive recurrent 
and has a unique stationary surplus distribution. We denote the unique stationary sur¬ 
plus distribution as Cpxa- Let be the surplus distribution at time k induced 


by the transition kernel {IT conditioned on the event that X[0] = x and there is no 


regeneration until time k. fpxaf) and Cpxct(’) are related as follows: 


SpX^T 


oo oo « 

{B) = ^(1 - /3)/3'=E^ (c,fl{B\X)) = ^(1 - / cfxl(S|x)dvI/(x). 

' ■ 




( 12 ) 


Thus CpxaiB) in terms of Cp'‘J^{B\x) is simply based on the properties of the conditional 

expectation. And note that in , the random variable X is the initial 

condition of the surplus, generated by 4'. For a; G X, the only possible one-step updates 
are the increase of the surplus to a; + w or a decrease to x — I, i.e. B = {x w,x — 1}. 


5. MEAN FIELD EQUILIBRIUM 

The action distribution p is a probability mass function on the action set A: let be 
the probability of choosing action i. Note that p lives in the probability simplex on 
which is compact and convex; denote it as Fp. Let C be the stationary surplus 
distribution and the set of all such possible surplus distributions is denoted as r<^, 
which is a compact and convex subset of loa- For a given surplus x, let CT(a:) be the 
action distribution at x. Denote F^ as the set of all possible distributions over the 
action space for each x, which is compact and convex. We further assume that p G Fp, 
C G Fp and (j{x) G F^ for each x. 

Definition 5.1. Consider the action distribution p, the randomized policy a and the 
stationary surplus distribution Qp-. (i), Given the action distribution p, determine the 
success ^obabilities in the lottery scheme usin g dT] ! and then compute the value func¬ 
tion in (^. Taking the best response given by ® results in an action distribution ct; 
(ii), Given action distribution p, following the randomized policy a yields transition ker¬ 
nels for the surplus Markov chain and stationary surplus distribution fp, (with each 
transition kernel having a unique stationary distribution); and (iii), Given the sta¬ 
tionary surplus distribution fp, applying the randomized policy a(x) at each surplus 
X yields the distribution of actions p. Define the best response mapping II* that maps 
Fp 0 Flf' 0 F^ into itself Then we say that the assumed action distribution p, random¬ 
ized policy a and stationary surplus distribution fp constitute a mean field equilibrium 
(MFE) if II* : p 0 CT 0 ^p i-G p 0 O' 0 Cp has (p, tr, ^p) as a fixed point. 

5.1. Existence of MFE 

Theorem 5.2. There exists an MFE of p, the randomized policy a(x) at each surplus 
X and f, such that p G Fp, <7{x) G Fcr and C € F^, Va G M and dx G X. 

We will be specializing to the spaces Fp, F^,, F^ and define the topologies being used 
in the following proofs first. 

(1) For the assumed action distribution p G Fp on the finite set A, all norms are equiv¬ 
alent, we will consider the topology of uniform convergence, i.e., using the /oo norm 
given by ||p|| = maxa6^p(a). 

(2) For the randomized policy a G f!^', we enumerate the elements in X as 1,2, • • • , 

and consider the topology with norm ||cr|| = where |cr(x)| = 
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maxag^ cr(a:, a). We consider any sequence {crnj^i converges to a in this topology 
space. 

(3) For the surplus distribution Q on the countable set X, we consider the topology of 
pointwise convergence, which can be shown to be equivalent t o convergence in loo, 
i.e., uniform convergence, using coupling results presented in |Li et al. 2016b|. 


Note that from the definition of Fp, Fj, and F^, they are already non-empty, convex 
and compact. Furthermore, they are jointly convex. Then in order to show that the 
mapping H* satisfies the conditions of Kakutani fixed point theorem, we only need to 
verify the following three lemmas. 

Lemma 5.3. Given p, by taking the best response given by (i- we can obtain the 
action distribution a{x) for every x, which is upper semicontinuous in p. 

Lemma 5.4. Given p and a{x), there exists a unique stationary surplus distribution 
((x), which is continuous in pand a{x). 

Lemma 5.5. Given ({x) and a{x), there exists a stationary action distribution p, 
which is continuous in C,{x) and a{x). 

6. CHARACTERISTICS OF THE BEST RESPONSE POLICY 

In this section, we characterize the best response policy under the assumption that 
Vp in ([^ has some properties. Then we discuss the relations between the incremental 
utility function u{x) and the optimal value function Vp. 

6.1. Existence of Threshold Policy 

We assume that given the action distribution p, Vp{x) is increasing and submodular 
in x when x < —1; increasing and linear in x when —l<x<w; and increasing and 
supermodular in x, when x >w. 

In Section our lotteries will be constructed such that the probability of winning 
monotonically increases with the cost of the action. This when combined with the 
monotonicity, submodularity (decreasing differences) for positive argument and super- 
modularlity (increasing differences) for negative argument of Vp yields the following 
characterization of the best response policy. 


> da 


SO 


Lemma 6.1. For any two action, say actions ai and 02 , suppose that 
that Pp^ai > Pp,a 2 > 4’{Pp,af) > then there is a threshold value of the surplus 

queue for user such that preference order for the actions changes from one side of the 
threshold to the other. 

Similarly, if we simply assume that Vp(a;) is increasing and submodular in a; G 


(—00,00), or increasing and supermodular in x 
istence of threshold policy. 


G (—00,00), it will also yield the ex- 


6.2. Relations between incremental utility function u{x) and the optimal value function W 

6.2.1. S-shaped prospect incremental utility function. Consider the S-shaped prospect in¬ 
cremental utility function u{x), which satisfied the following conditions: (i) u{x) is a 
concave increasing function of x when x > w; (ii) u(x) is a linear increasing function of 
X when —l<x<w; (iii) u{x) is a convex increasing function of x when x < —1; and (iv) 
u{x) is continuous on (—00,00). 

From (|^, it is clear that the Bellman operator will take concave functions into con¬ 
cave functions. Since the limit of any sequences of functions is the value function, all 
we need to prove is that the limit of a sequence of concave functions must be concave 
because we start from our iterations with a concave function, but note that our concave 
functions are defined by a weak inequality. Thus the set defining it must be closed, so 
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the value function is concave. However, in our case, that set is not a closed set given 
X > —1. It is only closed for a: > 0. 

Therefore, we cannot prove that Vp(-) is increasing, convex in {—oo,—l], linear in 
[—Z,w;] and concave in [u;,oo), but intuitively Vp{-) should also be convex, linear and 
then concave as x increases from —oo to oo. Indeed, we observed this property from the 
numerical studies in Section |7] 

In other words, we conjecture that the optimal value function is supermodular, lin¬ 
ear and then submodular as x increases from —oo to oo, which will yield an optimal 
threshold policy. 


6.2.2. Concave/Convex incremental utility function. Now if we simply assume that the in¬ 
cremental utility function u{x) is concave/convex and monotone increasing in x, then 
we can obtain the following useful properties of the optimal value functions, which we 
can use to characterize the optimal threshold policy. 

Next, we show that the value function Vp{x) is increasing, submodular (i.e., decreas¬ 
ing differences) in x if u{x) is concave in x, and increasing, supermodular (i.e., increas¬ 
ing differences) in x if u{x) is convex in x. 

Lemma 6.2. Given the distribution of action p, Vp{x) is an increasing and submod¬ 
ular function of x if u(x) is a concave function of x, supermodular function of x if u{x) 
is a convex function of x. 


7. NUMERICAL STUDY 

We now conduct an empirical data-based simulation in the context of electricity usage 
for home air conditioning to illustrate the likely performance of our nudge system in 
the context of electricity demand-response. As mentioned in Section our context 
is that of a Load Serving Entity (LSE) trying to incentivize its customers to shape 
their electricity consumption for air conditioning so as to reduce its cost of electricity 
purchase from the wholesale market. These incentives could increase the net surplus 
of the end-users, the profit of the LSE, or the total welfare of these agents as well. We 
will explore a benchmark scheme that simply provides a fixed return to customers, as 
well as a lottery-based scheme that provides aggregates savings to provide a single 
larger reward. 

In order to conduct a realistic simulation, we need good estimates of electricity prices 
and hom e AC electricit y usage. Historical electricity prices are available directly from 
ERGOT I ERGOT 2014[ . Since AG usage depends on the ambient temper ature and the 
home A(I! installation, insulation etc., we first obtain a data set from BPecan-Street 
2014) containing the ambient temperatures over the course of each day in June- 
August, 2013, which form our candidate months. We then use a standard home model 
with a nominal thermostat setting of 22.5°G that experiences the ambient tempera¬ 
tures over each day of the data set to determine the electricity usage at different times 
of day. The thermostat setpoint can be adjusted to modify electricity usage based on 
the incentives provided. We consider a cluster of 50 such homes, and we determine both 
the surplus of the end-users and the savings of the LSE as compared to the baseline in 
which the home do not modify their electricity usage patterns. 

While we present the case of homogeneous homes that all have identical parame¬ 
ters and an identical baseline, it is straightforward to extend our results to the case 
where there are a finite set of classes of homes each with its own baseline, and the 
participating homes are drawn randomly from these classes. 
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7.1. Home Model 


A standard continuous time model | Callaway 2009| Hao et al. 2015) for describing the 
evolution of the internal temperature T(t) at time t of an air conditioned home is 


f - ^('^(0 - ra) - ^Pm, if q{t) = 1 , 

Ht) = < ^ (13) 

[ - - T-a), ifq(t)= 0 . 

Here, is the ambient temperature (of the external environment), R is the thermal 
resistance of the home, C is the thermal capacitance of the home, 77 is the efficiency, 
and Pm is the rated electrical power of the AC unit. The state of the AC is described 
by the binary signal q{t), where q{t) = 1 means AC is in the ON state at time t and 
in the OFF state if q{t) = 0. The state is determined by the crossings of user specified 
temperature thresholds as follows: 


lim q{t + e) 


q{t): \r{t)-Tr\<A, 

1, T{t) > Tr + A, 

0, T{t) <Tr - A, 


(14) 


where Tr is the temperature setpoint and A is the temperature deadband. 


Table I. Parameters for a Residential AC Unit 


Parameter 

Value 

C, Thermal Capacitance 

R, Thermal Resistance 

Pm, Rated Electrical Power 

T), Coefficient of Performance 
Tr, Temperature Setpoint 

A, Temperature Deadband 

10 kWh/°C 

2 °C/kW 

6.8 kW 

2.5 

22.5 °C 

0.3 °C 


A number of studies investigate the thermal properties of typical homes. We use the 
parameters s hown in Table for our simulations. These are based on the derivations 
presented in | Callaway 200 ^ for temperature conditioning a 250 m^ home (about 2700 
square feet), which is a common mid-size home in many Texas neighborhoods. 

In order to determine the energy usage for AC in our typical home, we need to know 
how the ambient temperature varies in Texas during the summer months of interest. 
These values are available in the Pecan Street data set, and we plot the values of 3 
days which are arbitrarily chosen over three months for Austin, TX in Figure 
Next, we calculate the ON-OFF pattern of our typical air conditioner based on the 
ambient temperature variation over the course of the day. We do this by simulating 
the controller in ( |14[ ) with the appropriate ambient temperature values taken from 
Figure The pattern is presented in Figure We see that there is higher energy 
usage during the hotter times of the day, as is to be expected. This also corre^onds 
to the peak in wholesale electricity prices shown for the same period in Figure [l] The 
total energy used each day corresponding to our 5 ton AC (= 6.8 kW; see TabIe|I]l is 
32.83 kWh. 


For comparison, we use the Pecan Street data set to provide the measured daily 
average energy usage for AC during the same period for 4 homes that have parameters 
in the same ballpark as our typical home. These values are shown in Table [III Th® table 
shows the close match of our home model with real AC usage patterns. 


7.2. Actions, Costs and LSE Savings 

The customer action space in our problem consists of choosing when to turn ON and 
OFF the AC, and is uncountably infinitely large. We need to pick a reasonable discrete 
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Ambient Temperature 



Fig. 3. Ambient temperature of 3 arbitrary days 
from June-August, 2013 in Austin, TX. Mea¬ 
surements are taken every 15 minutes from 12 
AM to 12 PM. 

Table II. Daily AC U 


ON-OFF Pattern for AC 



Fig. 4. Simulated ON/OFF state of AC over a 
24 hour period in a home and the corresponding 
interior temperature. The interior temperature 
falls when the AC comes on, and rises when it is 
off. 

tge for Four Flomes 


ID 

Square feet 

Age 

AC (tons) 

Energy (kWh) 

93 

2934 

20 

5 

28.2 

545 

2345 

6 

3.5 

41.5 

4767 

2710 

5 

4 

31.7 

3967 

2521 

5 

3.5 

37.8 


subset of the action space for our study. From Figure it is clear that the consump¬ 
tion during maximum price period 5 — 6 PM has the maximum impact on the overall 
energy cost of the LSE. The LSE would like to incentivize the shift of some of this 
usage, without excessively affecting the internal home temperature. We assume that 
the actions available to the customer involve setting different setpoints during each 
hour from 2 — 8 PM, i.e, 6 periods (hours) in total. We denote each period by an index 
j, where j = 1 indicates the period 2 — 3 PM and so on until j = 6, which indicates the 
period 7 — 8 PM. Each action can now be identified with a vector (t/i, ?/ 2 , 2 / 3 , 2 / 4 , 2 / 5 ,2/6), 
where yj indicates the setpoint in the period j. We take the setpoint 22.5°C as the base¬ 
line. Hence, the vector (22.5,22.5,22.5,22.5,22.5,22.5) would indicate a baseline action 
in which the customer does not change the original setpoint in each period. The set of 
all such setpoint vectors defines an action set A, and we define the action with index 
a = 0 to be the no-change action. The action set is uncountably infinite, and hence we 
identified 5 other actions that appeared to have the most promise of being used. These 
actions are shown in the second column of Table Hill The identification of the best set of 
five actions (or a different number depending on the complexity allowed) is for future 
work, where ideas from reinforcement learning and continuum bandits can be applied. 

Our next step is to calculate the cost of taking each action a G A, which corresponds 
to the discomfort of having a potentially higher mean and standard deviation in the 
home temperature, and higher energy consumption due to that action. We measure the 
state of the home under action a s M by the tuple consisting of the mean temperature, 
the standard deviation and energy usage, denoted [fa,cra,Ea]. The baseline state of 
these parameters is under action 0, denoted by [fo,cro,Eo] Then we define the cost of 
taking any action a as 


0a = |to - Ta\ + A|cro “ <^a\ “ l(Eo - E^), 


(15) 


where we choose A = 10 to make the numerical values of the mean and standard de¬ 
viation comparable to each other and = 10 <:/kWh as the fixed energy price. We note 
that the map between temperature variation, discomfort suffered, and its measure- 
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ment in cents is not obvious. However, given the fact that the customer uses between 
1 — 3 kWh or about <:10 — 30 per hour to obtain a temperature differential between the 
ambient temperature and interior temperature of about 15 — 20°C, the discomfort cost 
of a degree C temperature increase being bl seems reasonable in the limited tempera¬ 
ture range that we are interested in. Note that the calculation of cost for each action 
involves simulating the home under that action to determine [tq, a a, Ea]. However, this 
has to be done only once to create a look-up table, which can be used thereafter. Note 
also that each action in A is chosen to be close to energy neutral, i.e., the third term 
in (15 1 is essentially zero. Thus, we focus on modifying usage time, not the total usage. 
Table Hi] shows our selection of actions and their corresponding costs. In future work 
we will explore other cost metrics, perhaps different risk measures. 

When applied over a day, each action could result in some savings to the LSE to¬ 
wards the costs it incurs in purchasing electricity. We measure the day-ahead price of 
electricity experienced by the LSE in dollars/MWh and denote the price at time period 
j in day i as TTij, where i = {1,2, • • • , 92} and j = {1, • • • , 6). Each action vector of a 
customer would impose a net price on the LSE in proportion to the usage. We define 
the differential price measured in dollars imposed by an action y = (yi, 2 / 2 , ?/ 3 , 2 / 4 , 2 / 5 , ye) 
versus z = (zi, Z 2 , 23 , 24 , Z 5 , ze) as 




(16) 


where k converts the setpoints into electricity usage in each period, which is measured 
in MWh. Setting y as the baseline action (22.5,22.5,22.5,22.5,22.5,22.5) presents a way 
of measuring the reduction/increase in cost due to the incentive scheme. 

We calculated the savings of each action applied over each day of our three month 


data set and obtained the average savings. These values are also shown in Table [HI 
As is clear, the cost of taking each of our selected actions is considerably lower than the 
savings resulting from that action, and hence it might be possible to create appropriate 
incentive schemes to encourage their adoption. We will consider two such schemes, 
namely, (i) a fixed reward scheme used as a benchmark, and (ii) a lottery based scheme. 


Table III. Actions, Costs and LSE Savings 


Index 

Action Vector 

Cost (b) 

LSE Savings «;) 

0 

(22.5, 22.5, 22.5, 22.5, 22.5, 22.5) 

0 

0 

1 

(21.5, 21.5, 22.25, 23.5, 23.75, 21.25) 

3.68 

27.7 

2 

(21.5, 21.5, 22.25, 23.5, 23.25, 22.25) 

3.15 

22.7 

3 

(21.5, 21.5, 22.25, 24, 23, 22.5) 

2.68 

22 

4 

(22,22,22.25,23,23,22.5) 

1.34 

19 

5 

(22, 22, 22.25, 23.25, 22.5, 22.75) 

0.95 

16.4 


7.3. Benchmark Incentive Scheme 

A simple incentive scheme to get users to adopt cost saving actions (from the LSE’s 
perspective) is to calculate the expected savings of each action, and to deterministically 
reward each agent with some percentage of the expected savings for taking that action. 
Such guaranteed savings are similar in spirit to rebate for using public transportation 
during off-peak hours, and a system of s haring a fraction of the savings by demand- 
response providers such as OhmConnect | OhmConnect 2015) . We will use this scheme 
as a benchmark in order to determine whether shared savings are large enough to 
encourage meaningful participation. Thus, our benchmark incentive scheme attempts 
to incentivize each action by returning to the us er s ome fixed fraction of the expected 
LSE savings for that action presented in Table |III[ For example, a return of 50% for 
taking action 1 would imply awarding (;;13.85 each time that action is taken. 
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Table IV. Mean Day-ahead Price and Energy Coupons 


Index 

Period 

Mean Day-ahead Price/MWh 

Coupons/kWh 

1 

2-3PM 

$47 

107 if a;i > 2.464; 1.8 otherwise 

2 

3-4PM 

$55 

5.4 

3 

4 - 5 PM 

$78 

1.8 

4 

5-6PM 

$99.6 

0 

5 

6-7PM 

$66.5 

3.6 

6 

7 - 8 PM 

$49.5 

54 if xq > 2.24; 1.8 otherwise 


7.4. Lottery-Based Incentive Scheme 

Our second incentive scheme is lottery-based, with Energy Coupons being used as lot¬ 
tery tickets. Now, the baseline action a = 0 corresponds to a setpoint of 22.5°C in period 
3 at which 7 ri _3 is highest (Figure for any day i. Hence, the LSE should incentivize 
actions that are likely to reduce the risks of peak day-ahead price borne by the LSE by 
offering Energy Coupons in proportion to the usage during the corresponding periods. 
The problem of optimally selecting these coupons is a hard one in general. However, in 
the limited context of our simulation, it is intuitively clear that co upon s must be placed 
at periods of lower price. Our coupon choices are shown in Table [rv| where xi and xq 
are energy usage in the corresponding periods (measured in kWh) and the mean day- 
ahead price. Note that the number of coupons are not necessarily integer, although 
making them integer quantities will not have any impact on our results. 

Given the coupon placement by the LSE, the customers need to determine the num¬ 
ber of coupons resulting from each action, and use these values to estimate the utility 
that they would attain. Our six actions are shown in Table with their attendant 
costs and number of coupons received. The LSE conducts an lottery each week across 
clusters of M = 50 homes participating in each lottery. For each cluster, there is N = 1 
prize for winning the lottery. We assume that the customers choose the same action on 
each day of the week, and then participate in the lottery. 


Table V. Actions, Costs and Energy Coupons 


Index 

Action Vector 

Cost (b) 

Coupons 

0 

(22.5, 22.5, 22.5, 22.5, 22.5, 22.5) 

0 

37.4 

1 

(21.5, 21.5, 22.25, 23.5, 23.75, 21.25) 

3.68 

715 

2 

(21.5, 21.5, 22.25, 23.5, 23.25, 22.25) 

3.15 

693 

3 

(21.5, 21.5, 22.25, 24, 23, 22.5) 

2.68 

577 

4 

(22,22,22.25,23,23,22.5) 

1.34 

434 

5 

(22, 22, 22.25, 23.25, 22.5, 22.75) 

0.95 

222 


7.5. Utility and Surplus 

As described in Sectionj^ the user state consists of his/her surplus. Any rewards result 
in an increase in surplus by the reward amount w, whereas performing an action but 
not receiving a reward results in decreasing the surplus by some amount 1. Since a 
reward is assured for each action in the benchmark incentive scheme, there are no 
surplus decrease events. However, in the lottery scheme, a user that does not win the 
lottery would see a decrease in surplus. We select I that results from an action a to 
be the same value as the reward corresponding to action a in the benchmark scheme 
that has the same total reward. In other words, the loss of surplus due to losing at the 
lottery whose prize is $15 is chosen to be the same as the gain that would have resulted 
from the same action in the fixed reward benchmark scheme where the total amount 
disbursed is $15. This choice allows for a fair comparison between the two schemes to 
determine if the user is better off in one scheme versus the other. 

For the customer utility, which maps surplus to utility units, we use the value func¬ 
tion of prospect model (defined in ®), u(x) = x''' if x > 0 or u{x) = —ip(—x)'^ if x < 0 , 
where y? = 2.25 and 7 = 0.88 according to the empiri cal studies conducted in ITversky 
and Kahneman 1992 |Kahneman and Tversky 1984| |. The utility model applies to both 
thi 


the benchmark and the lottery scheme. Under this model, we expect a user who has 
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lost a number of lotteries to stop participating in the system, since his surplus becomes 
negative and he is not receiving enough of an incentive to stay, given the cost he bears 
each day. Similarly, a user who has won too many times would have a large surplus, 
and would also not be keen on participating since the marginal utility he gets may 
not be high enough for him. The latter observation applies to the benchmark as well, 
although given the small rewards, we do not expect it to happen frequently. 

The participants in the lotte^ scheme see a distorted probability of winning, pa¬ 
rameterized by as defined in (|^. This is an important feature of our model, since it 
captures the attracti veness of lot teries in incentivizing risky actions. Consistent with 
empirical studies in I Prelec 1998) , we choose ^ = 0.37. 

We assume that customers are likely to remain in the system with probability 0.92, 
i.e., the average customer participates for 12 time steps, which roughly parallels the 
fact that the main summer season lasts for about three months in Texas. Further, a 
newly entering customer has a surplus that is an integer set to be 0. 


7.6. Equilibria Attained by incentive Schemes 

Benchmark Scheme. As described in Section 7.3 we construct a fixed-reward type of 
incentive scheme to obtain a benchmark with which to compare the performance of 
the lottery scheme. Under the benchmark scheme, customers are awarded some per¬ 
centage of the expected savings that their action is likely to yield to the LSE, shown in 
Table]^ The actual action chosen by the customer will be determined using a dynamic 
program (DP) similar to the one defined in ([^, based on their surplus (state). However, 
since rewards are deterministic, there is no dependence on the belief over what the 
other users will do, and the only randomness is from the lifetime of the user. Thus, 
solving the DP is straightforward, and we can easily obtain a map between surplus 
and action for a given reward. For illustration, if we assume that $15 is given as the 
total reward (which corresponds to the LSE returning about 22% of its savings back to 
the customers), it is easy to calculate that when the action distribution that results is 
[0,0,0,0,0,1], i.e., action 5 is taken by all customers. 


Lottery Scheme. We next determine the properties of the MFE generated by our lottery 
system. We start with a uniform action distribution as the initial condition. For illus¬ 
tration, we take a prize value of $15, which implies that the customer expects to win at 
least <;30 on average by participating, i.e., the decrease in surplus due to losing at the 
lottery is I = 0.3, while the increase in surplus due to winning isw = 15 — 0.3 = 14.7. We 
run the system over 65 iterations, determining the steady state action distribution at 
each step and using that as the input for the next iteration. We find that convergence 
occurs rapidly and reaches within 0.1% of the final value within 20 iterations. 

We determine of the value of each state by value iteration over the Bellman equation 
describing value (Equation ([^), keeping the action distribution of other players fixed. 
Value iteration converges within about 50 steps for each state. The eventual values 
to which each surplus value converge is the mean field surplus distribution, which is 
shown in Figure |^(Le/if). It indicates that customers win at a lottery at most once over 
an average lifetime of 12 time intervals, as is to be expected with a cluster size of 50 
customers at each lottery. The mean field action distribution under the lottery scheme 
with a $15 reward is [0,0,0.81,0, 0.19 ,0], i.e., the action 2, which is the best action from 
the LSE’s perspective (see Table [ni] l is chosen with probability 0.81. 

Example. Figure (Middle) shows the interior temperature under actions 0, 2, 4, the 
mean field action distribution and benchmark action when $15 is the total reward 
amount. We see that the mean field behavior is more aggressive than the benchmark 
in reducing the interior temperature before the peak period, and shows a marginally 
higher interior temperature during the peak period. Figure (Right) shows the com- 
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Fig. 5. (Left) Mean Field Distribution of Surplus; (Middle) Simulated ON/OFF state of AC over a 24 hour 
period under actions 0, 2, 4, the mean field action and the benchmark action on an arbitrary day and the 
corresponding interior temperature. The temperature graph is slightly offset for actions 2,4, the mean field 
action and the benchmark action for ease of visualization; (Right) Average daily energy usage profile. 

parison of energy consumption between action 0 (doing nothing, with an average en¬ 
ergy consumption of 36.5 kWh per day), the mean field action distribution (average 
energy consumption of 36.7 kWh per day), and the benchmark action (average energy 
consumption of 36.4 kWh per day). We see that the mean field distribution is more ag¬ 
gressive in moving energy usage away from the peak period as compared to the bench¬ 
mark, although both have essentially the same energy consumption and an identical 
reward value of $15 per week. Finally, we compute that the savings to the LSE over 50 
homes each week in this is example is $57.4 in the benchmark scheme and $77 in the 
lottery scheme. Thus, incentivizing customers by offering a prize of $15 each week is 
certainly feasible. The MFE illustrates that even as small as 1°C change of the setpoint 
of AC each day over several homes can yield significant benefits. 


7.7. Performance Analysis of Incentive Schemes 

Benchmark Scheme. We consider a range of scenarios wherein the LSE rewards cus¬ 
tomers for each action with between 1% — 100% of its expected savings, in steps of 1% 
increments. The relations between the total weekly reward to customers, savings to 
the LSE and profit to the LSE, are shown in Figure ^(Le/if). We see that the maximum 
weekly profit of $52 is achieved when about 9% savings (about $10 in total per week) 
is given to the customers as reward. Note that although the reward under the bench¬ 
mark scheme is indicated by a percentage returned, it corresponds to a dollar value 
returned based on the actions of the customers, and the total dollar reward values are 
also shown in green (dashed line) in Figure [^(Le/i). 





Fig. 6. The relation between offered reward, LSE savings and LSE profit: (Left) Benchmark incentive 
scheme and (Middle) Lottery scheme; (Right) The relation between profit to the LSE and the expected value 
of a generic customer when different rewards are given to the customers. 

Lottery Scheme. We next conduct numerical experiments under a range of lottery re¬ 
wards from $5 to $95 in steps of $5 increments as we did (by using percentage returns) 
with the benchmark scheme. Hence, we set a reward value, calculate the values of I 
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and w that it implies, and compute the mean field equilibrium as we did in the pre¬ 
vious subsection. Our results are shown in Figure [^(MjddZe), where we plot the total 
savings to the LSE as well as its profit (savings minus reward) as a function of the 
reward offered for winning the lottery. From Figure]^ (Middle), the maximum profit is 
achieved by giving a reward of $15. Also, from the mean field action distribut ion t hat 
results from this reward, described in detail in the example used in Section [7.6| we 
note that almost all the customers will participate in the system with this reward, i.e., 
the probability of choosing action 0 is 0. The break-even point is at about $80 reward. 

From Figures {Left) and {Middle), we see that the maximum profit using the 
lottery scheme is about $62 per week, while the maximum profit is only about $52 
under the benchmark scheme. Given that a typical LSE has several hundred thousand 
customers, a difference of $10 each week over a cluster of 50 homes is quite significant. 

Comparison of Local Social Welfare. Our final step is to characterize the local social wel¬ 
fare under the lottery and the benchmark, respectively. We define the (expected) local 
social welfare (LSW) measured in dollars/week as 

LSW = 50EMFE[kj(A)] + profits of the LSE. (17) 

For the lottery scheme, the maximum expected value to each user of roughly 600 
each is obtained when the LSE is revenue neutral. This is roughly 100% better than 
what can be achieved with the benchmark individual incentive. This increase is due to 
both the prospect-based utility as well as coupling across users in the lottery scheme. 

Note also that the LSE obtains a maximum profit $62 by giving a $15 reward, un¬ 
der which each customer will take actions 2 and 4 with probabilities 0 .81 a nd 0.19, 
respectively, from the mean field action distribution, described in Section pL6| The cor¬ 
responding surplus distribution is shown in Figure |^(Le/if), and the expected value of 
a generic customer is about $189.8. Therefore, the local social welfare is about $9552. 

For the benchmark, the profit to the LSE is $42 when $15 reward is given as shown 
in Figure {Left), under which each customer will only take action 5. We omit the 
plot for the surplus distribution due to space constraints. The corresponding expected 
value of a generic customer is about $56.2, and the local social welfare is about $2852. 
Again, we see that the lottery scheme outperforms the benchmark. 

We perform the same analysis to determine the relation between profit to the LSE 
and the expected value of a generic customer when different rewards are given. Our 
results are shown in Figure {Right). We explore a range of rewards from $5 to the 
break-even point ($80 for lottery scheme and $95 for the benchmark). The points on 
each curve correspond to increasing the reward by $5 in steps in a manner indicated by 
the arrow marks. From Figure[^(i?jg/it), we see that the lottery scheme appears to bet¬ 
ter capture the frontier between LSE profit and customer value than the benchmark 
scheme, with the lottery-based incentive being better in a Pareto-sense. Thus, based 
on a desired level of customer value and LSE profit, the lottery scheme can always 
ensure a better outcome than the benchmark scheme with an appropriate reward. 

8. CONCLUSION 

In this paper we developed a general framework for analyzing incentive schemes, re¬ 
ferred to as nudge systems, to promote desirable behavior in societal networks by pos¬ 
ing the problem in the form of a Mean Field Game (MFG). Our incentive scheme took 
the form of awarding coupons in such that higher cost actions would correspond to 
more coupons, and conducting a lottery periodically using these coupons as lottery 
tickets. Using this framework, we developed results in the characteristics of the opti¬ 
mal policy and showed the existence of the MFE. 


A:22 


Li et al. 


We used the candidate setting of an LSE trying to promote demand-response in the 
form of setting high setpoints in higher price time of the day in order to transfer energy 
usage from a higher to a lower price time of day for an air conditioning application. 
We conducted data driven simulations that accurately account for electricity prices, 
ambient temperature and home air conditioning usage. We showed how the prospect 
of winning at a lottery could potentially motivate customers to change their AC usage 
patterns sufficiently that the LSE can more than recoup the reward cost through a 
likely reduced expenditure in electricity purchase. Further, we showed that a lottery is 
more effective than a fixed reward at enabling such desirable behavior and can attain 
a better tradeoff between social value and LSE profits. 

Our setup is general enough to capture population behavior in other societal net¬ 
works. For example, it applies in an essentially unchanged manner to an ex periment 
conducted o n a bus transportation system of an IT firm in India, described in | Merugu 
et al. 2009) . Here, employees have a choice of an early morning bus that experiences 
low traffic congestion or a later one that experiences more. Providing incentives to em¬ 
ployees in the form of lottery tickets for taking the earlier bus was shown to increase its 
attractiveness, while simultaneously reducing costs to the firm by running a smaller 
number of buses at higher fuel efficiency. 

APPENDIX 

A. PROPERTIES OF THE OPTIMAL VALUE FUNCTION 
Proof of Lemma 14, II 

We first show that Tp/ e $ for V/ € $. The proof the n follows through a verification 
of the conditions of Theorem 6.10.4 in (Puterman 1994|. From the definition of Tp in (|^, 
we have 


\Tpf{x)\ < |u(a;)| + max ea(x 
a{x)GA 


■/3max(|/(a; + w)|, |/(a; - 01)- 


From this it follows that 

—Of \Of \ ■ 


Xn.aKa{x)(^A^a{x) , „ f \f{x + w)\ \f{x-l)\ 

-^ P ^'^P ——’ S'^P CM ^ 


Let x+ be the unique positive surplus such that u{x+) = 1 and x- be the unique 
negative surplus such that u{x-) = —1. Note that n(a;) is non-decreasing for x > x- 
and non-increasing for x < x+. To avoid cumbersome algebra we will assume x+—w > 0 
and x- + l > 0. Since n{x) > \u{x)\ > 0 and il(x) > 1, the first two terms are bounded 
by 1 and inaxa(x)eA (^a{x) - For the last term we have 


sup 

tcGX 


\f{x + w)\ 
n(x) 


< 


In sup 

tGX 


n(x + w) 
n(x) 


We have the following 


n(x + w) 
n(x) 


= 


'4x) — 

For X > x+, we know using monotonicity of u( ) 

u(x + w) — u{w) w 


( u(x-\-w) ^ u(x-\-w) -n \ 

- r / f 1 ^ \ , it X > X+, 

u(x-\-w) ^ u(x-\-w) ’r ^ [ 1 

- / / /in < -h - it X £ \X^ — W,X^\, 

max(u(ai),l) — 11(3:4-) ’ ^ L -h : -|-J5 

1 , if X € [x_, X-i-— ic], 

if X G [x_ — w, x_], 
if X < x_ — w. 


_ 1 _ < 1 

\u(x)\ ^ L 


u{x^w) 


< 1 , 


u{x + w) 
u{x) 


= 1 + 


w 


u{x) 


< X _l_ + w) — u{w) 


w 


























Mean Field Games in Nudge Systems for Societal Networks 


A:23 


Additionally, for x S [x+ — tr, a;+] we have 
u{x + w) u{x + w) — u{x+) 


= 1 + 


, , ^ u(x + w) — u(x+) 

(x + w - a:+) < 1 + ^- - - ^-^w. 


u(x+) X + W — X+ ' ' '' X + W — X+ 

For the analysis we assume that u( ) is Lipschitz such that sup^.^^ u'{x) < +oo. There¬ 
fore, by the mean value theorem 

u{x + w) — u(x) 


= < sup u'{x), 

u{x + w)-u{x+) , , 

-^- = U ($ 2 ) < sup u (x), 

^ ^ ^- t - xG[x^—w,x^] 

n{x + w) , 

sup < ||/||o(l + w sup u{x)), 

x£X x>x+—w 


V^i,x e [x+, 00 ), 
V^ 2 ,a; G [x+ - w,x+], 
Vx G [x+ — w, 00 ). 


Similarly, we have 


|/(x — /)| n(x — i) 

sup '^„, ' < ll/llasup ^ 

xGX x&X ^i[Xj 


Now we have the following 


f2(x — 1) 
n(x) 


u{x — l) 


< 


u{x—l) 


if X < x_, 


min{zi(af),— 1 ) — u{x) ’ 

u(x — l) ^ u(x — l) • r ^ r I 

- / / IX < -7-f, II X € \X-,X- i 

max(u(rE),— 1 ) — u{x_) ’ L ’ 


= < 1 


< 1, 


n{x-l) ^ 

, u{x) - 

Using the same logic as before, we get 

n(x — i) 


if X G [x_ + I, x+], 
if X G [x+,x+ -F 1], 
if X < x+l. 


®^P' or r 

X^X 


< 


|a(l-l-i sup u'{x)). 

x^X:x<.x- 


Since u(-) is Lipschitz, thus, there exists an no G (0, + 00 ) such t hat ||To/||n < cto. 

Next, we need to verify the conditions of Theorem 6.10.4 in JPuterman 1994| . The 
lemma requires verification of the following three conditions. We set x[k\ to be the 
state variable denoting the surplus at time k. We need to show that Vx G X, for some 
constants (independent of p) ai > 0, Q !2 > 0 and 0 < as < 1, 


sup |u(x) - 6»a(x)| < ain(x), 

'j,{x)^A 


(18) 

(19) 


Ex[i].ao[^(a;[l])|a;[0] = x] < a 2 ^{x), Voq G A, 

with the distribution of x[l] chosen based on action oq, and 

/3‘^E^[j].ao.ai....,aj_i (a;[>/]) 1^::[0] = a:] < a 3 fl(x), 

for some J > 0 and all possible action sequences, i.e., aj G for all j = 0,1,..., J — 
1 with the distribution of x[J] chosen based on the action sequence (oq, ai,..., aj_i) 
chosen. 


( 20 ) 


First consider (18l. Since f2(x) = max(|u(x)|, 1), using the earlier analysis in Sec¬ 
tion ( [iSl l is true with m = 1 -F maxae^ da- Now consider We have 

Ex[l],ao [f^(a;[l])|a;[0] = x] = Ep[(j){pp^aix))n{x + w)+ (/)(1 - Pp,aix)), u(x - ()] 

< max(n(x -F w), n(x — ()), 
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which is bounded by a 2 ^ix) using our analysis from before. 

Finally, (|20jl holds true using the moperties of n( ), the bounds on the probability 
of winning and losing (from Section and our analysis from earlier in the proof as 
follows: 


J]) [0] = x] 

, ^(1 — 2 ^^)y max(n(x + Jw),U,{x — Jl)) 

<{/3 max((/)(p^y), (^(1 - p^)ya4{J)n{x), 

for some affine a 4 (J) > 0 using our analysis from before. It now follows that take J 
large enough we obtain an as < 1 that is also independent of p. Note that we can get a 
simpler bound of 


l3'’'Ex[J],ao,ai,...,aj.iPix[J])\x[0] = x] < /3'^a4 (J)n(x) , 


using just the properties of (!(•). Again we can take J large enough to obtain a as < 
1 that is independent of p. This bound is useful when there is an action for which 
the probability of winning or losing is 1. Since all the conditions of Theorem 6.10.4 of 
I Puterman 1994| are met, then the first result in the lemma holds true. The second 
then follows immedi ately from 

Proof of Lemma 14,21 

For any given p, from Lemma [4.1| we know that there is a unique Fp( ). Furthermore, 
it is the unique fixed point of operator Tp where is a contraction mapping with 
constant as that is independent of p. From it follows that is a continuous in p: 
computing derivatives using the envelope theorem and the expressions from Section!^ 
it is easily established that is, in fact, Lipschitz with constant (M — 1)*^ when the 
uniform norm is used for p. 

Let Pi and p 2 be two population/action profiles such that ||pi — P 2 II < e (the choice 
of norm is irrelevant as all are equivalent for finite dimensional Euclidean spaces). As 
is continuous in p, there exists a (5 > 0 such that \\T;^^Vp^ — T^Vp^Wa < 5. However, 
since T^Vp^ = Vp^, we have shown that WT^Vp^ — Vp^Wo, < 5. Applying n times and 
using the contraction property of , we get 


The proof then follows since lim„_,,oo ||T’”^‘^Vp 2 — FpJ|n = 0 so that 


00 

iiFp, - Fpjin < E w^y^'^^Vp, - < 

n—O 


1-03 


Furthermore, using the comment from above we can show that Vp is Lipschitz contin¬ 
uous in p. 


B. THE EXISTENCE AND UNIQUENESS OF STATIONARY SURPLUS DISTRIBUTION 
Proof of Lemma 14,31 

First, from the transition kernel ( |ll[ l, we satisfy the Doeblin condition as 

P(a:[A:] S B\x[k — 1] = x) > (1 — j3)'^{B), 

where 0 < /3 < 1, and 'F is a probability m easure for the regeneration process. Then 
from results in | |Meyn and Tweedie 20091 Chapter 12], we have a unique stationary 
surplus distribution. 
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Next, let —T be the last time before 0 that the surplus has a regeneration. Then we 
have 

oo oo 

C,pxa{B) = F{B, T = k) = Y ^{B\t = k) ■ P(r = k). (21) 

fc =0 fe =0 

Since the regeneration process happens independently of the surplus with inter¬ 
regeneration times geometrically distributed with parameter (1 — /3), then P(t = k) = 
(1 — 13)15^. Also given t = k, we have X-k ~ Therefore 

OO OO 

UAB) = ^(1 - l3)f3^F{B\T = k)= ^(1 - /3)/3'=E (E (l,[o]6B|r = k,X_k =X)\t = k) 

oo oo 

= ^(1 - /3)/3'=E (c,^Y(B\X)\t = k)= (c,fL{B\X)) 

oo 

= ^(l-/3)/3'= 

/ c =0 


&YB\x)d^{x). 


( 22 ) 


B.1. Existence of MFE 
Proof of Lemma I5.3I 

Define the increasing and piecewise linear convex function 

9p(.y) = max(/)(pp,a)?/ -6a= max V aa{(t>(pp,a)y - Qa), ( 93 ) 

a^A o-GAfU ) 

a£A 

where A(^) is the probability simplex on A = |A| elements. By the properties of the 
lottery and the weight function 4’iPp,a) is continuous in p for all a & A. Using 
Berge’s maximum theorem, we have 

arg max Y. {(I){pp,a)y - Oa) (24) 

a^A 


is upper semicontinuous in p. 
Now let 


A{y) ■- argmaxg(y) = argmax(/i(pp,a)2/ - ^a, 

a£A 


(25) 


then set-valued function above is exactly A(|A(y)|). 

Hence, the optimal randomized policies at surplus x are a set-valued function 
= ^(Ai^pix + w) — Vp{x — Z))|), which is upper semicontinuous due to the 
Lipschitz continuity of Vp{-) in p and the u.s.c. of 4>{Pp,a) in p, i.e., for every state x, the 
action distribution aixAs (pointwise) upper semicontinuous in p. 

Proof of Lemma I5.4I 

The existence and uniqueness of ({x) for a giv en p and a(x), and the relationship 
between ({■) and Y\-) are shown in Lemma 4.3 Now, we will prove the continuity of 


Cpxcr in p and a{x) for every surplus x G X. For the assumed action distribution p on 
the finite set A, we consider the topology of pointwise conver gence which is e quivalent 
to the uniform convergence by strong coupling results in | Li et al. 2016b| |. For the 
randomized action distribution a, corresponding to a{x) at each surplus x G X, we 
consider the topology with metric p{a^,a^) = niin(||cr^(xj) — cr^(xj)||, 1), where 

|j • |j is any norm for 1 
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First, we will show that the surplus distribution is continuous in p and a. By 


Portmanteau theorem, we only need to show that for any sequence Pn ^ p uniformly. 


I —^ a pointwise, and any open set B, we have liminf„ 
Lemma B.l. liminf^^oo > clxU-Sja;). 


x(fc) 

> ^Pn XfTn 


{B\x) > 


Proof of Lemma 


B.l 


The proof proceeds by induction on k. For fc = 0, C 


(0) 

Pn XfTi 


{B\x) = 


l(a:G B) IS a point-mass at x irrespective of Pn x cr„, and in fact, for any n € N+, we have 
Cp°x(T„(^k) = Cpxo-(^k)- Lst Pn ^ P uniform, and cFn{x) a{x) pointwise for every 

surplus X. We will show that Cpt xcr„ {B\x) converges pointwise to c}p^^{B\x). 

We will refer to the measure and random variables corresponding to p„ x cr„ for the 
system and those corresponding to p x tr as coming from the limiting system. We 


^th 


will prove that Cp^xcr„(^k) converges to Cp^^AB\x) pointwise using the metrics given 
above. 

Suppose that the hypothesis holds true for /c — 1 where k > 1, i.e., 
converges pointwise to C,'^p~^\B\x). To prove this lemma, we only need to show that 
the hypothesis holds for k. Let Ppxa.xC ) be the one-step transition probability mea¬ 
sure of the surplus dynamics conditioned on the initial state of the surplus being 
X, and there is no regeneration. Then we have Pp^x<Tr,,x(x + w) = X]aGo-„(x) 

Pp„ xcr„,x(a; —0 = px<j,x{x+w) = X^aGo-(x) Ppx = 

1 - Yl,a^(T(x) Ppy.<y,a- By the properties of the lottery, Ppxa,a is continuous in p x ct for all 
a G A, thus we have Pp^xa„,a converges to Ppxa,a pointwise, i.e., Pp xa„.a:(-) converge s 
to Ppx<T,a;( ) pointwise. By the Skorokhod representation theorem |Biilingsley 2013), 
there exist random variables and X on common probability space and a random 
integer N such that ~ Pp„xcr„,a;(-) for all n e N, and X ~ Ppx(T,a:(-) , and = X for 
n> N. 

Then we have. 


-(fc) 


lhrimiCl%,jB\x) = liminfE > E (liminf 

n—>-00 ^ n—¥(X) y n y n—¥00 rn n j 

> E (cfxd^(i?|X)) = ^^pL{B\x\ 


(26) 


where the second and third inequality hold due to Fatou’s lemma and the induction 
hypothesis. Hence, for a given p and randomized policies a{x), the unique stationary 


surplus distributi on Cp ^xcr„(-^|3^) converges pointwise to CpxaiB\x). 

Now by Lemma |4. 3 and Equation (fl^, we need to show that lim inf„ 
Cpxcr(H). By Fatous lemma, we have 


(k) 


Xp^x.AB) > 


pk) 

"^Pn X <77, 




liminf Cp„x<t„(H) = liminf V(1 - /3)/3'=E,i, (c, 

n—)-oo n—>-oo \ ' 

/c=0 

oo oo 

> ^(1 - /3)/3'=E,j. (ihninf [cfLiB\X) 


/c=0 




— CpXcr(H). 

(27) 


Thus, for a given p and the randomize policies (j{x), the unique stationary surplus dis¬ 
tribution Cp„x(T„ converges pointwise to Cpxa- Then the stationary surplus distribution 
Cpxcr is continuous in p a nd a{x) for every surplus a; G X. 

Proof of Lemma 15.51 
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Given the stationary surplus distribution ({x) and the action distribution a{x) at 
every surplus x, those will introduce a population profile based on the actions chosen at 
each point x, denoted that action distribution as p, and we have pa = ' ^aix), 

where a e X is a countable set and ^ is a finite set. 

To show that p is continuous in C(x) and cr(a;), we only need to show that for any se¬ 
quence converging to C in uniform norm, converging to a{x) point- 

wise, we have converges to p pointwise, which is equivalent to convergence in 

uniform norm as we have a finite set A. 

Since Cn C uniformly, we have Vei > 0, 3Ni e N, so that Vn > A^i, Va: S X, 
lCn(a;) — C(a^)l ^ Cl- Similarly, {tTn(a;)}5?Li converges to cr(a;) pointwise, we have Wx S X, 
and Ve 2 > 0, 3N2 S N so that Vn > N 2 , , |crn(a;) — cr(a;)| < £ 2 . Now consider Ve = 
max(ei, £ 2 ), we can find an all but finite subset Xi of X, such that — f • 

N — max(7Vi, N 2 ), for Vx S X\Xi, 3n > N large enough, such that |cr„^a(;r) — aa{x)\ < 
Then Vx GX,\/a e A, we have 


|Pn,a Pal — 


Cn{x)an,aix) “ '^Cix)(Ja{x) 


Cnix)(Tn,a{x) “ Cn{x)cra{x) + (n{x)<Ta{x) - '^C{x)aa{x) 


< 


y^ Cnix)crn,a{x) - Cn{x)cra{x) 


Cn{x)aa{x) - Y ({x)aa{x] 


< y^Cn(a;)|cr„_a(a;) - cra(x)| + y^cra(x)|C„(a;) - C(a^)l 


= X! Cn(a:)|o-„,a(a;) - cra(x)| + Y Cn(a:)Icr„,o(x) - cra(x)I + y^cTa(a:)ICn(a;) - C(a^)I 

a;GXi a;GX\Xi x 

Y ^ i ^ ^ 

xGXi 3fGX\Xi X 

e e 

< —-iH-l- — H-6i-l<e*lH-6*l = 2e, (28) 

where (a) follows from the fact that |tT„,a(a:) — o-o(a;)| < 1 for Vx e X, and |<T„,a(a:) — 
o'a(x)| < |, for X e X\Xi given £ > 0 and n large enough, and the convergence of („■ (b) 
follows from that XxgXi C(x) < § for x S Xi. 

Therefore, — Pa\ < 2c for all a S Vl and Vn > N, hence Pn ^ p pointwise, which 
is equivalent to convergence in uniform norm as we have a finite set A. 

C. CHARACTERISTICS OF THE BEST RESPONSE POLICY 
Proof of Lemma 16.1 1 

First, we consider x G X and x > 0. We have 

u{x) - 9a^(x) + l3[pp^a2{x)Vp{x -F w) "F (1 -pp,a 2 (a:))Fp(x - 1)] 

^ m(x) - Oa^i^x) + /3bp.ai {x)Vp{x -F ic) + (1 - Pp,a^ {x))Vp{x - 1)] 

■FF” dai(x) 9a2(x) ^ /3[(Pp,ai (x) Pp,a2 (a'))hp(a^ “f 

+ ((1 - Pp.a^ (x)) - (1 - Pp,a 2 (x)))Vp(x - 0] 

9a,(x) - 9a2(x) ^ /3(Pp,ai(x) - Pp,a 2 (x)) [Vp(x + w) - Vp(x - 1)]. 


( 29 ) 
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As we assumed > Oa2{x)> it follows that pp^a^ix) > Pp^a2{x)- Also, since w + I > 0 
and Vp{x) is increasing in x, so both sides of the above inequality are non-negative. 
Since Vp{x) is submodular when x > —I, the RHS is a decreasing function of x. Let 
xl, „„ G X be the smallest value such that LHS > RHS, then for all x > x* „„ action 
a2{x) is preferred to action ai{x), for all 2^ < <i.a2 action oi (a;) is preferred to action 
a 2 ix), and finally, if at LHS=RHS, then at a;*^ the agent is indifferent between 

the two actions, and if instead LHS > RHS, then action 02(3;) is preferred to action 
ai(a;). We call a;*^ f^e threshold value of surplus for actions ai(a;) and 02(3;). 

Similarly, for a; G X and a; < 0, V^(a;) is supermodular when x < w, which implies the 
existence of a thres hold p olicy. 

Proof of Lemma 16.21 

First, let / G $, suppose that / is an increasing and submodular function. First we 
prove that Tpf is increasing and submodular too. Let a* (a;) be an optimal action in 
the definition of Tpf{x) when the surplus is x, i.e., one of the maximizers from Let 
xi > X2, then 

Tpfixi) - Tpf{x2) = u(a;i) - M(a;2) - 0 a*(xi) +fi'o*(x2) + P[Pp,a-{xi){xi)f{xi +w) + 

(1 -Pp,a*(xi)(a;i))/(a;i - l)-Pp,a*(x2){x2)f{x2+w) - {1 - Pp,a*{x2){x2))f{x2 “ O] 

> u(a;i) - u{x2) - 0a-(x2) + ^a*(x2) + P[p p,a* (x2){x2) f [Xi + w) 

+ {^-Pp,a‘(x2){x2))f{Xl - l)-Pp,a-ix2){x2)f{x2+w) - {1 - Pp^a* {x2){X2)) f {X2 “ 0] 
= U{xi) - U{X2) + /3[pp,a-{x2)iX2){f{xi + W + a) - f{x2 + w) 

+ (1 - Pp,a-{x 2 ){x 2 )){f{xi - 0 - f{x 2 - 0)] > 0. 

The first inequality holds because a*{x 2 ) need not be an optimal action when the sur¬ 
plus is xi. 

Again, let xi > X2 and let x > 0. Since u( ) is a concave function, it follows that it is 
submodular, i.e., 

u(xi + x) — u(xi) < u(x2 + x) — u{x2) u{Xi + x) + u(X2) < u{x2 + x) + m(Xi). 

Assuming that / G $ is submodular, we will now show that Tpf is also submodular. 
Consider 

Tpfixi + x) + Tpf{x2) = u[xi + x) + u(X2) - 0 a>{xi+x) “ Sa-{x2) 

+ /3[Pp.a-(x i+x){xi +x)/{xi +x + u;) +Pp,a‘(x2){x2)f{X2 + w) 

+ {^-Pp,a*(xi+x){xi +X))/(X1 + X-l) + { 1 -Pp^a*(x 2 )ix 2 ))f{x 2 - 1 )] ■ 

We assume without loss of generality that Pp,a*{xi+x)ixi + x) > Pp,a*{x2)ix2) and let 6 
be the difference; if Pp,a*{xi+x){xi + x) < Pp,a*(x2)ix2), then a similar proof establishes 
the result. Using this we have the RHS (denoted by d) being 

d = u{xi + x) + U{X 2 ) - 0 a^(^xi+x) “ Sa-{x2) + / 3 [Pp,a-{x2){X2){f {Xl + X + w) + /(X2 + w)) 
+ - Pp,a-(xi+x){xi +X))(/(X1 + X-l)+ f(x 2 - 1)) + S{f{xi + X + w) -L /(X 2 - 0)] ■ 

By submodularity of /(•) we have 

/(Xi + X + w) + f{x2 +w) < f{x2 + X + w) + /(Xi + w), 

/(Xi + X - /) + /(X2 -1) < f{X 2 +X-l)+ /(Xi - 1), 

/(xi + X + ir) + /(X2 -1) < f{x2 + X + w) + /(xi - 1). 

With these and using the submodularity of u{-) we get 

d < u{x2 + x) + u(xi) - 9 a-(xi+x) “ 9 a*(x 2 ) + P[Pp,a-(x2){x2){f{x2 + X + w) -L /(xi + w)) 
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+ {^-Pp,a*(xi+x){xi +X)){f{x 2 + X - 1) + f{xi - 1)) + S{f {X 2 + X + w) + f{xi - Z))] 
= u{X 2 +3::) - Oa^^xi+x) + l3[Pp,a-{x2){x2)f{.X2 + X + w) + {1 - Pp,a* {X 2 ){x 2 )) f {X 2 + X - 1)] 
+ U{xi) - 0 a-{x 2 ) + P[Pp,a-{x 2 ){X 2 )f{xi + w) + {1 - Pp,a* {xi+x){xi +x))f{xi - /)] 

< Tpf{x2 +x)+ Tpf{xi), 


where the last inequality holds as using the optimal actions (a*(a ;2 + a;), a*(xi)) yields a 
higher value as opposed to the sub-optimal actions (a*(a;i-l-x), a*(a; 2 )) when the surplus 
is X 2 + X and xi. 

Since both the monotonicity and submodularity properties are preserved when tak¬ 
ing pointwise limits, choosing /(•) = 0 (or u{-)) to start the value iteration proves that 
the value function V'p( ) is increasing and submodular. 

Similarly, if / s $ is an increasing and supermodular function, following the same 
argument, we can prove that the value function Vp{-) is increasing and supermodular. 
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